Project background:

This study project is a repeat of work performed at Cornell University by Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock. Their paper Finding Deceptive Opinion Spam by Any Stretch of the Imagination details their methodology.

They collected, as a result of their methodology, 1600 reviews, of which 800 were deceptive, and 800 were truthful (i.e. written by an actual hotel guest). The deceptive reviews were created under contract with human workers, who were given a minute to write the review, had to live in the US, etc. 400 of each set of reviews were positive, and 400 were negative, leading to the dataset being examined:

This project will use a Support Vector Machine (an SVM) which is a fancy way of saying this project will calculate a place in the data where it can cleave spam from non-spam. In mathematical terms, we will calculate a hyperplane through a hyperspace defined by dimensions such as word counts, and parts-of-speech counts (nouns, adjective, etc.)

The project is written in one of my favorite programming languages, the statistical software language known as R. If you want to follow along, click on the code button to see the software in action.

Exploratory Data Analysis:

The first part of every data science project is to read in the data and “explore” it, preferably using graphics to gain insight.

Initialize and load libraries

################################################################################
### Initialize environment
################################################################################
rm(list=ls())
library(tidyverse)
library(gridExtra) #viewing multiple plots together

 # Text Mining Packages
library(tidytext)
library(tokenizers)
library(wordcloud2) #creative visualizations
library(spacyr)
spacy_initialize()

# Graphics Packages
library(ggthemes)
library(moments)
library(ggplot2)
library(scales)
library(knitr) # for dynamic reporting
library(kableExtra) # create a nicely formated HTML table
library(formattable) # for the color_tile function

publication_theme <- function() {
    theme_economist() +
    theme(text=element_text(family="Rockwell"),
          plot.title = element_text(family="Rockwell", size=12),
          panel.grid.minor = element_blank(),
          strip.text.x = element_text(size = 18, colour = "black")
    )
}

Project_Dir <- "/Users/amkhosla/Desktop/Statistics/Projects/Hotel_Reviews/code"
setwd(Project_Dir)

Read and label the hotel reviews

Once all the programming libraries and environment have been loaded, we need to read in the reviews (one per file) and label them as truthful or deceptive. Here’s a look at 4 sample reviews - for each category - from the resulting database of 1600 reviews.

training_path = '../input/op_spam_train/'
training_files <- list.files(path=training_path, pattern="*.txt", full.names=TRUE, recursive=TRUE)
if (length(training_files) != 1600) { 
   stop (paste("Couldn't locate input training files at:",  
                print(getwd()), print(training_path)))
}
training.df <- as.tibble(as.data.frame(training_files, 
                                       col.names = c("Training_File"),
                                       stringsAsFactors = FALSE))
Truthfulness Likes Hotel_Reviews
deceptive negative We stayed at the Schicago Hilton for 4 days and 3 nights for a conference. I have to say, normally I am very easy going about amenities, cleanliness, and the like…however our experience at the Hilton was so awful I am taking the time to actually write this review. Truly, DO NOT stay at this hotel. When we arrived in our room, it was clear that the carpet hadn’t been vacuumed. I figuered, “okay, it’s just the carpet.” Until I saw the bathroom! Although the bathroom had all the superficial indicators of housekeeping having recently cleaned (i.e., a paper band across the toilet, paper caps on the drinking glasses, etc., it was clear that no ACTUAL cleaning took place. There was a spot (probably urine!) on the toilet seat and, I kid you not, the remnants of a lip-smudge on the glass. I know people who have worked many years in the hotel industry and they always warned that lazy housekeeping will make things “appear” clean but in fact they make no effort to keep things sanitary. Well, the Hilton was proof. I called downstairs and complained, and they sent up a chambermaid hours later. Frankly, I found the room disgusting. The hotel itself, outside the rooms, was cavernous and unwelcoming, with an awful echo in the lobby area that created a migraine-inducing din. Rarely have I been so eager to leave a place as this. When I got home, I washed all my clothes whether I had worn them or not, such was the skeeviness of our accomodations. Please, do yourself a favor and stay at a CLEAN hotel.
truth negative My $200 Gucci sunglasses were stolen out of my bag on the 16th. I filed a report with the hotel security and am anxious to hear back from them. This was such a disappointment, as we liked the hotel and were having a great time in Chicago. Our room was really nice, with 2 bathrooms. We had 2 double beds and a comfortable hideaway bed. We had a great view of the lake and park. The hotel charged us $25 to check in early (10am).
deceptive positive My husband and I satayed for two nights at the Hilton Chicago,and enjoyed every minute of it! The bedrooms are immaculate,and the linnens are very soft. We also appreciated the free wifi,as we could stay in touch with friends while staying in Chicago. The bathroom was quite spacious,and I loved the smell of the shampoo they provided-not like most hotel shampoos. Their service was amazing,and we absolutely loved the beautiful indoor pool. I would recommend staying here to anyone.
truth positive Stayed here October 31 through November 5 for a cconference. This is beautiful hotel and the location is perfect. The Art Institute and Millenium Park is a short 2 block walk from the hotel. My room was small, but beautiful. The bed was comfy and there were plenty of pillows. Alas, no coffee maker in the room…The staff was helpful and friendly. I’d stay here again!



Now comes the first analysis of the text. The first thing to examine in any textual analysis is the so called bag-of-words model. We break up the reviews into a set of words, and then we analyze those single words.

###
# FIX CONTRACTIONS like won't can't, etc.
# function to expand contractions in an English-language source
fix.contractions <- function(doc) {
    # "won't" is a special case as it does not expand to "wo not"
    doc <- gsub("won't", "will not", doc)
    doc <- gsub("can't", "can not", doc)
    doc <- gsub("n't", " not", doc)
    doc <- gsub("'ll", " will", doc)
    doc <- gsub("'re", " are", doc)
    doc <- gsub("'ve", " have", doc)
    doc <- gsub("'m", " am", doc)
    doc <- gsub("'d", " would", doc)
    # 's could be 'is' or could be possessive: it has no expansion
    doc <- gsub("'s", "", doc)
    return(doc)
}

cleanup_review <- function(aReviewStr) {
    the_cleansed_string <- aReviewStr
    the_cleansed_string <- fix.contractions(the_cleansed_string)
    the_cleansed_string <- gsub("[^a-zA-Z0-9 ]", " ", the_cleansed_string)
    theTokens <- tokenize_words(the_cleansed_string, stopwords = stopwords::stopwords("en"))[[1]]
    theLongerTokens  <- theTokens[sapply(theTokens, function(aToken) (nchar(aToken) > 3))]
    the_cleansed_string <- paste(theLongerTokens, collapse = " ")
    return(the_cleansed_string)
}
training.df$Filtered_Reviews <- sapply(training.df$Hotel_Reviews, cleanup_review)

tokenized_unigram.df <- training.df %>% 
    unnest_tokens(word, Filtered_Reviews) %>%
    distinct() 

tokenized_bigram.df <- training.df %>% 
    unnest_tokens(ngram, Filtered_Reviews, token = "ngrams", n = 2, collapse=FALSE) # %>%

Preliminary distributions

Zipf’s law says that the more common a word is the shorter it is. It also says that the frequency of that word has a “Zipfian” distribution - common words (typically the top 2000-3000 words in any language) have a very high occurence/probability, and then the rest rapidly drop off. As a quick sanity check of the data, let’s see if the reviews have a “Zipfian” distribution.

word_frequency <- tokenized_unigram.df %>%
    dplyr::count(word, sort = TRUE) 
freq_range = 1:1000
barplot(word_frequency$n[freq_range])

Looks Zipfian!

Sentiment analysis

The first 800 reviews are “negative polarity”, meaning they dislike the hotel, and then the next 800 are “positive polarity” meaning they like the hotel. We can perform a “sentiment” analysis (determining whether a writer’s attitude is positive, negative, or neutral) on the reviews to see if the data seems reasonably labeled for thumbs-up/down polarity. Here we use the “bing” sentiment dictionary to categorize words as harsh or kind.

reviewssentiment <- tokenized_unigram.df %>%
    inner_join(get_sentiments("bing"), by = "word")  %>%
    count(training_files, sentiment) %>%
    spread(sentiment, n, fill = 0) %>%
    mutate(sentiment = positive - negative)

negative_sentiment_average <- mean(reviewssentiment$sentiment[1:799])
positive_sentiment_average <- mean(reviewssentiment$sentiment[801:1599])
ggplot(data=reviewssentiment[1:799,], aes(reviewssentiment$sentiment[1:799])) + 
        geom_histogram(binwidth = 1) +
        xlab("Sentiment Level") +
        labs(title = "Sentiment for Negative Reviews") +
        publication_theme()

ggplot(data=reviewssentiment[801:1599,], aes(reviewssentiment$sentiment[801:1599])) + 
        geom_histogram(binwidth = 1) +
        xlab("Sentiment Level") +
        labs(title = "Sentiment for Positive Reviews") +
        publication_theme()

Negative reviews have an average sentiment level of -0.29, positive ones an average of 7.4, so it seems likely the data has been correctly labeled as positive or negative “thumbs-up”. Even when they are being negative, people tend to be kinder than harsher. Good. 😁

Deceptive words vs truthful words

Let’s plot the probability that words show up in reviews, by whether they show up in truthful, or deceptive reviews, or both.

Words that show up ONLY in deceptive reviews or ONLY in truthful reviews

word_distribution <- as.tibble(count(tokenized_unigram.df, word, Truthfulness, sort = TRUE))
tidy_word_distribution <- spread(word_distribution, Truthfulness, n) 
tidy_word_distribution <- tidy_word_distribution %>% 
    replace_na(list(deceptive = 0, truth = 0)) %>% 
    mutate(deceptive_proportion = deceptive / sum(deceptive))  %>%
    mutate(truth_proportion = truth / sum(truth))

deceptive_words_only <- tidy_word_distribution %>%
    filter(truth_proportion == 0) %>%
    filter(deceptive > 2) %>%
    mutate(word = reorder(word, deceptive))   #Reorders word into a factor, based on n....
deceptive.barplot <- ggplot(deceptive_words_only, aes(word, deceptive)) +
        geom_col() +
        xlab(NULL) +
        coord_flip() +
        labs(title = "Words that appear ONLY in deceptive reviews") +
        theme_economist() +
        theme(text=element_text(family="Rockwell"),
                plot.title = element_text(family="Rockwell", size=12),
                axis.text.x = element_text(size=rel(1)),
                axis.text.y = element_text(size=rel(1)))
deceptive.barplot

true_words_only <- tidy_word_distribution %>%
    filter(deceptive_proportion == 0) %>%
    filter(truth > 2) %>%
    mutate(word = reorder(word, truth)) #Reorders word into a factor, based on n....
true.barplot <- ggplot(true_words_only, aes(word, truth)) +
        geom_col() +
        xlab(NULL) +
        coord_flip() +
        labs(title = "Words that appear ONLY in  truthful reviews") +
        publication_theme()
true.barplot

There are words that only show up in deceptive reviews, and similarly there are words that only show up in truthful reviews: As you can observe, the deceptive reviews are slightly richer on superlative words, and the truthful reviews seem richer on nouns. This implies that parts-of-speech tagging might be a useful part of our approach.

Words that show up in BOTH truthful and deceptive reviews

How about words that show up in both truthful and deceptive reviews?. Looks like the same thing, nouns show up in the top left truthful section, adjectives in the bottom right deceptive section. Looks like people who write deceptive reviews like the words luxury, accomodations, amazing and smell.

Deceptive reviews smell bad! And they are “amazing”. LOL 😂

# Remove words that occur only on axes from the plot
plot_word_distribution <- tidy_word_distribution %>%
    filter((truth_proportion > 0) & (deceptive_proportion > 0))

spam.plot <- ggplot(plot_word_distribution, 
       aes(x = deceptive_proportion, y = truth_proportion )) +
    geom_abline(color = "gray40", lty = 2) +
    geom_jitter(color = "red", alpha = 0.3, size = 2., width = 0.2, height = 0.2) +
    geom_text(aes(label = word), check_overlap = TRUE, size = 3, fontface = "bold", vjust = 1.5) +
    scale_x_log10(labels = percent_format()) +
    scale_y_log10(labels = percent_format()) +
    # xlim(-0.001, 0.002) + ylim(-0.001, 0.002) +
    theme(legend.position="none") +
    labs(y = "% Truthful (log scale)", x = "% Deceptive (log scale)") +
    publication_theme()
spam.plot

Now would be a good time to quote the authors of the original study:

However, that deceptive opinions contain more superlatives is not unexpected, since deceptive writing (but not necessarily imaginative writing in general) often contains exaggerated language.

With some insight into the distribution and type of words used in spam review, let’s build a machine learning algorithm. The original authors got a 90% accuracy match with their SVM. Again a quote is in order:

Notably, a combined classifier with both n-gram and psychological deception features achieves nearly 90% cross-validated accuracy on this task. In contrast, we find deceptive opinion spam detection to be well beyond the capabilities of most human judges, who perform roughly at-chance—a finding that is consistent with decades of traditional deception detection research (Bond and DePaulo, 2006).

So… astonishingly, human intelligence has a roughly 50% accuracy at detecting deception (no better than random). This explains much to me.

Lets see if machine-learning can do better 😉

Feature Engineering

We will need to extract more information from the words we have at our disposal. The authors of the study created two sets of parameters for their machine-learning “hyperspace” - one is the parts-of-speech tagging for a review (adjectives, nouns, etc.).

You might be surprised to learn that identifying a word’s part of speech is a fairly classic AI technology, and even very simple approaches routinely get 90% accuracy. The other thing the author’s discovered is using bigrams/n-grams instead of single words. This means they looked at paired words. For example the previous sentence, instead of being broken down into {this, means, they, looked, at, paired, words} breaks down into two word combinations. {this means, means they, they looked, looked at, at paired, paired words}

Let’s perform these bits of “feature engineering” - creating additional features from the text to improve our machine learning success. First lets tag the words with their parts-of-speech.

Testing spacyr - an open-source Parts-of_Speech (POS) tagger

Several open-source parts-of-speech (POS) taggers are available. I’m using spacyr. Trained using a neural net, this POS tagger has a 92% accuracy, and is a (relatively) fast classifier - important considering the amount of data we are mining. As an example of its use, let’s parse a joke:

A Texan, a Russian and a New Yorker go to a restaurant in London.
The waiter tells them, “Excuse me – if you were going to order the steak, I’m afraid there’s a shortage due to the mad cow disease.”

The Texan says, “What’s a shortage?”
The Russian says, “What’s a steak?”
The New Yorker says, “What’s ‘excuse me’?”

txt <- c(line1 = "A Texan, a Russian and a New Yorker go to a restaurant in London.",
         line2 = "The waiter tells them, Excuse me -- if you were going to order the steak, I'm afraid there's a shortage due to the mad cow disease.",
         line3 = "The Texan says, What's a shortage?",
         line4 = "The Russian says, What's a steak?",
         line4 = "The New Yorker says, What's 'excuse me?")

# process documents and obtain a data.table
parsedtxt <- spacy_parse(txt)
parsedtxt$sentence_id <- NULL
# Show label and review
kable(parsedtxt[1:20,], format = "markdown")
doc_id token_id token lemma pos entity
line1 1 A a DET
line1 2 Texan texan PROPN NORP_B
line1 3 , , PUNCT
line1 4 a a DET
line1 5 Russian russian PROPN NORP_B
line1 6 and and CCONJ
line1 7 a a DET ORG_B
line1 8 New new PROPN ORG_I
line1 9 Yorker yorker PROPN ORG_I
line1 10 go go VERB
line1 11 to to ADP
line1 12 a a DET
line1 13 restaurant restaurant NOUN
line1 14 in in ADP
line1 15 London london PROPN GPE_B
line1 16 . . PUNCT
line2 1 The the DET
line2 2 waiter waiter NOUN
line2 3 tells tell VERB
line2 4 them -PRON- PRON

Tagging the hotel reviews

We’ll first tag the reviews And then clean them up by removing information-free words like I, me, the, etc.

# Create a list of docId, review (spacyr's input format for text)
text_data <- c()
text_data[training.df$training_files] <- training.df$Hotel_Reviews
reviews.pos.raw <- spacy_parse(text_data)

# Standardize format
names(reviews.pos.raw)[names(reviews.pos.raw)=="token"] <- "word"
names(reviews.pos.raw)[names(reviews.pos.raw)=="doc_id"] <- "training_files"
reviews.pos.raw$token_id <- NULL

# Remove all tokens less than 3 characters and remove stop words
reviews.pos <- reviews.pos.raw %>%
    filter(nchar(word) > 3) %>%
    filter(pos!="PART") %>%
    anti_join(stop_words)

reviews.df <- inner_join(training.df, reviews.pos, by="training_files")
reviews.df$Hotel_Reviews <- NULL
reviews.df$Filtered_Reviews <- NULL
kable(head(reviews.df[,2:8]), format = "markdown")
Truthfulness Likes sentence_id word lemma pos entity
deceptive negative 1 stayed stay VERB
deceptive negative 1 Schicago schicago PROPN ORG_I
deceptive negative 1 Hilton hilton PROPN ORG_I
deceptive negative 1 days day NOUN DATE_I
deceptive negative 1 nights night NOUN TIME_I
deceptive negative 1 conference conference NOUN

Confirming our parts of speech hypothesis

Now let’s confirm our parts of speech hypothesis Let’s look at the parts of speech distribution for both truthful and deceptive reviews

reviews.pos.count <- reviews.df %>%
                        group_by(Truthfulness) %>%
                        dplyr::count(pos,sort=TRUE)

kable(reviews.pos.count[order(reviews.pos.count$pos, reviews.pos.count$Truthfulness),], 
      format = "markdown")
Truthfulness pos n
deceptive ADJ 6507
truth ADJ 6704
deceptive ADP 426
truth ADP 267
deceptive ADV 2278
truth ADV 2155
deceptive CCONJ 7
truth CCONJ 12
deceptive DET 265
truth DET 303
deceptive INTJ 48
truth INTJ 48
deceptive NOUN 17616
truth NOUN 19959
deceptive NUM 83
truth NUM 266
deceptive PRON 164
truth PRON 181
deceptive PROPN 3996
truth PROPN 3775
deceptive PUNCT 8
truth PUNCT 88
deceptive VERB 8478
truth VERB 8398
deceptive X 2
truth X 14

Hmmn… Compared to deceptive reviews, truthful reviews have:

  • Significantly more nouns
  • Significantly less adverbial particles (nearby, while, after, before, etc.)
  • Significantly more numbers
  • Somewhat less proper names
pos.barplot <- ggplot(reviews.pos.count, aes(pos,n, fill=Truthfulness)) +
        geom_col(position="dodge") +
        coord_flip() +
        xlab("Number of times POS appears") +
        labs(title = "Truthful and Deceptive Reviews - Parts-of-speech Distribution") +
        theme_economist() +
        theme(text=element_text(family="Rockwell"), 
              plot.title = element_text(family="Rockwell", size=12))
pos.barplot

Next Steps:

This will end the Exploratory Analysis part of this project. Coming Up is the Machine Learning phase